{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![alt text](https://github.com/callysto/callysto-sample-notebooks/blob/master/notebooks/images/Callysto_Notebook-Banner_Top_06.06.18.jpg?raw=true)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import requests\n",
    "\n",
    "from scrapy.selector import Selector\n",
    "from scrapy.http import HtmlResponse\n",
    "from scrapy import Request\n",
    "\n",
    "import plotly.plotly as py\n",
    "\n",
    "from IPython.core.display import HTML\n",
    "\n",
    "from notebook_code.scraping_support import *"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "\n",
    "Why on earth would you want to scrape a web page?\n",
    "\n",
    "In an ideal world, you wouldn't have to. Everyone would realize that pages they put up on the web contained information that was useful beyond the ways they imagined, and while they built their web pages to present that information the way they wanted to, they'd *also* provide a way to get just that raw data in a way that is easy for machines to process. Things are getting better as everyone realizes the interesting things we can do with the raw data and all levels of government are making their data more open and accessible. But we're not quite there yet, and web scraping is often the necessary glue we need to use to stick things together while we wait.\n",
    "\n",
    "By the end of this notebook, we'll be able to grab the quotations from [Quotes to Scrape](http://quotes-to-scrape.com) and print them out directly, without all of the other information and formatting on the page that we don't want. We'll also be able to count up the number of quotes per author and display that information!\n",
    "\n",
    "Here's what the page looks like:\n",
    "\n",
    "![website-to-scrape](images/website-to-scrape.png)\n",
    "\n",
    "\n",
    "## Ethical Scraping\n",
    "\n",
    "It's important to think about what you're doing when you're scraping a website. You're collecting information that someone else or many people took a long time to gather. Your scraping code will also access information on their website much more quickly than a human would. This could cause problems for the website if you're not careful. This could result in your IP address being blocked by the website or more serious measures to stop you from hurting their quality of service. You should definitely check for any official notices the website may make discouraging scraping and respect them. Even if you don't see a notice, you should avoid scraping information simply to show exactly the same information on your own site. You should be doing something meaningful to that information and give credit and link to the source of it.\n",
    "\n",
    "You should also be careful about designing your scrapers. Don't leave scrapers running automatically that you're not sure are \"good web citizens\". Also, avoid running them too often. Do you really need to update things every few minutes? Or could you do it once a day? How about once a week? Or only when you rerun a cell in a notebook (like we're doing here)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A Crash Course in HTML\n",
    "\n",
    "To properly scrape a page, you first need to know a little about HTML (also known as: hypertext markup language). It's what almost every website you'll find out there is built on.\n",
    "\n",
    "The main thing that makes HTML different than regular text is its use of **tags**. A `<` marks the beginning of a tag and a `>` marks the end of a tag. If you're just using a single tag on its own, it should end with a `/>`. For example, a line break is `<br/>`. Your browser doesn't need to know anything else about it to know it should put an empty line between that tag and whatever comes next.\n",
    "\n",
    "However, most tags are a little more sophisticated than our rather boring `<br/>`. Most tags do *something* to what  comes after them. For example `<b>` tells you to start **making everything bold. But eventually bold words just get distracting, so you'll want to stop making them all bold. You do that with `</b>`**.\n",
    "\n",
    "Notice that the only difference between the closing tag and the opening tag is that the closing tag begins with `</` instead of `<`. \n",
    "\n",
    "When your browser is figuring out how to display a page, it expects that you provide the closing tag whenever you use an opening tag. You can put whatever you want in between the opening and closing tags as long as it's valid HTML.\n",
    "\n",
    "For example, `<i>foo</i>` will italicize *foo*. If you want to also make **<i>foo</i>** bold, you could put the valid HTML `<i>foo</i>` inside an opening and closing `b` tags: \n",
    "\n",
    "```html\n",
    "<b><i>foo</i></b>\n",
    "```\n",
    "\n",
    "In case you're wondering, yes, you could also get the same result with:\n",
    "\n",
    "```html\n",
    "<i><b>foo</b></i>\n",
    "```\n",
    "\n",
    "That may seem like a lot of work to do something you could do in a couple of mouse clicks or key presses in your favourite word processor. The beauty of HTML is that all web browsers are expected to do the *same* thing when they run into any of the standard HTML tags. That means that the tags you wrote to make your text bold and italicized will result in **<i>bold, italicized</i>** text no matter what browser someone uses to read it.\n",
    "\n",
    "### Valid HTML Documents\n",
    "\n",
    "We've learned what basic valid HTML looks like. What about a whole HTML document? Here's the minimum an HTML document should have in order to be valid HTML5 (the latest version of HTML):\n",
    "\n",
    "```html\n",
    "<!DOCTYPE html>\n",
    "<html lang=\"en\">\n",
    "  <head>\n",
    "    <meta charset=\"utf-8\">\n",
    "    <title>Your page title goes here.</title>\n",
    "  </head>\n",
    "  <body>\n",
    "    Your main page content goes here.\n",
    "  </body>\n",
    "</html>\n",
    "```\n",
    "\n",
    "Web browsers can be pretty forgiving, and your page probably won't break if you miss some of these. Don't worry about trying to memorize this either. You can find it in more than enough places on line and it's perfectly okay to cut and paste to start out. Even the pros do that.\n",
    "\n",
    "We'll just be focusing on what happens within the `<body></body>` tags, so for the following examples, just remember that to make them into a valid HTML document, you'd have to put them inside the `<body></body>` tags above."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A Simple Scraper\n",
    "\n",
    "There's a lot more we could say about HTML and its standard tags, and if you're interested, you can read about those tags [here](https://www.w3schools.com/tags/default.asp). Along with CSS (or cascading style sheets), another standard expected to be supported by every browser, you can make text and images look exactly the way you want in any web browser. And with a third standard language, Javascript, you can even create sophisticated user interfaces that people can interact with. Bet your word processor can't do that!\n",
    "\n",
    "However, this notebook's purpose is to get you ready to scrape other peoples' nice looking web pages! And we now know enough about HTML to start doing that. So let's get into it!\n",
    "\n",
    "### Your First Mission\n",
    "\n",
    "Let's say we have the following within our document `<body></body>`:\n",
    "\n",
    "```html\n",
    "The answer to the question is: <b>42</b>.\n",
    "```\n",
    "\n",
    "Here's what that looks like:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "html_text = 'The answer to the question is: <b>42</b>.'\n",
    "HTML(html_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How would you get *just* the answer `42`?\n",
    "\n",
    "Well, you *could* just look for some part of the text before and after it. For example, you could look for `'is: '` and grab everything until the first `.`. That's after you remove that bold tag and any other HTML tags within the text:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "raw_text = remove_tags(html_text)\n",
    "before_text = 'is: '\n",
    "answer_start = raw_text.find(before_text) + len(before_text)\n",
    "answer_end = raw_text.find('.', answer_start)\n",
    "raw_text[answer_start:answer_end]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What are some problems with finding the answer this way? Well, first, it's a lot of work! But on top of that, if we change any of the text around the answer, we risk breaking the code we're using to grab it. What if, instead of stripping out the `<b>` tag, we used it as a way to locate the answer? That's where scraping libraries like *scrapy* come in. They can use the structure of the HTML itself to help drill down to the parts of it we're interested in.\n",
    "\n",
    "Here's how we can find our answer using *scrapy*:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "Selector(text=html_text).xpath('//b/text()').extract_first()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### XPaths\n",
    "\n",
    "What's going on here? Let's break up that first line above into 3 pieces:\n",
    "\n",
    "1. `Selector(text=html_text)`\n",
    "2. `xpath('//b/text()')`\n",
    "3. `extract_first()`\n",
    "\n",
    "The first piece is just a way of initializing **scrapy** with the HTML that we want it to dig into. We're creating the `Selector` object with our `html_text` string, and that object will do the hard work of figuring out the structure of that text string based on all the HTML tags, etc. in it. The last piece tells **scrapy** that we want to return the first match to our query. We could have used `extract` instead of `extract_first`, and received an array with a single element (`'42'`) that we would then have to pull out of the array.\n",
    "\n",
    "As important as the first and third pieces are, the second piece is where we'll be spending most of our time. This is where you tell **scrapy** what to look for and where to start looking for it.\n",
    "\n",
    "Let's modify our example just a bit:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "html_text = '''\n",
    "Answer to Question 1: <b>42</b><br/>\n",
    "Answer to Question 2: <b>2</b><br/>\n",
    "Answer to Question 3: <b>foo</b>\n",
    "'''\n",
    "HTML(html_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How would you get all 3 answers? Note that we've also changed the text that comes before and after the answers. This would completely break our very first code for extracting the answer. How much would you have to change the **scrapy** version?\n",
    "\n",
    "The great thing is you don't have to change either the first or second part at all! The answers are all still within those `<b></b>` tags. But now, instead of just as single match, we want *all* of them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "Selector(text=html_text).xpath('//b/text()').extract()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What if we put those answers inside another HTML element? We're going to use a `<div></div>` element that's kind of like an all-purpose container for things. Think of it as a box that you can put stuff in. That box can go into bigger boxes, and those boxes can be moved around without repositioning the things inside them. Here's what our answers look like inside a `div`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "html_text = '''\n",
    "<div>\n",
    "    Answer to Question 1: <b>42</b><br/>\n",
    "    Answer to Question 2: <b>2</b><br/>\n",
    "    Answer to Question 3: <b>foo</b>\n",
    "</div>\n",
    "'''\n",
    "HTML(html_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Nothing *visually* has changed about the text. Right now, our `div` is an invisible box. So, does that change what we need to do to find our answers? Nope!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "Selector(text=html_text).xpath('//b/text()').extract()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It seems like there might be just a bit of magic going on with that small bit of text within our `xpath` function call. Let's take a closer look at it. What does `'//b/text()'` mean to **scrapy's** `xpath` function? Just like we did above, let's pull out the pieces:\n",
    "\n",
    "1. `//` and `/`\n",
    "2. `b`\n",
    "3. `text()`\n",
    "\n",
    "Let's talk about parts 2 and 3 first. Those are pretty straightforward. `b` is just the name of the tag, without needing to put in any of the `<>`. `text()` is a special xpath function that tells it to look for a plain text element.\n",
    "\n",
    "`//` and `/` both mean to look within the current tag's children. If nothing is on the left, the \"current tag\" is the root of the document. The difference is that while `/` expects the tag on the right to be a direct child of the tag on the left, `//` will keep checking the children of children, retreiving anything that matches. Neither is better than the other. In fact, you'll likely need to play around a bit with them to find the right combination that is *specific enough* to not return a bunch of results you aren't interested in, but *general enough* that if someone made a small change to the website you're scraping, your scraper wouldn't break.\n",
    "\n",
    "Without any `//` you'd have to write out the exact path to get the same answer as above. Try it!:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "Selector(text=html_text).xpath('/html/body/div/b/text()').extract()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now try taking out one of those tags (we'll take out the `div` tag:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "Selector(text=html_text).xpath('/html/body/b/text()').extract()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Where could you add a single `//` to the above xpath string to get those results back?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "Selector(text=html_text).xpath('/html/body//b/text()').extract()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This means your `<b></b>` tags around the answers can be anywhere inside `<html><body></body></html>`, but not outside of it. If the `<div>` tag was replaced with `<p>` (for 'paragraph') or any other tag, your scraper would still work. If the answers were put another level deeper, the scraper would still find them."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using HTML attributes\n",
    "\n",
    "In addition to childern, any HTML tag can have a number of attributes. Here's what they look like:\n",
    "\n",
    "```html\n",
    "<some_tag attribute_1='value_1' attribute_2='value_2'>\n",
    "  ...\n",
    "</some_tag>\n",
    "```\n",
    "\n",
    "One attribute that has special meaning and is used frequently is `class`. `class` is used to tell the browser to use certain CSS style information when drawing the tag and its children. A tag can have many classes, each separated by a space. For example:\n",
    "\n",
    "```html\n",
    "<some_tag class='class_1 class_2 class_3'>\n",
    "  ...\n",
    "</some_tag>\n",
    "```\n",
    "\n",
    "We're not going to get into how CSS works here, though. What's important for our scraper is that these classes are used a lot and often have names that are related to the kind of information they contain. This isn't a rule. They can really be anything you want. But the reason people will often use the type of information as a class name is that they don't want to have to think about the structure of the page and where that tag is just to change some small part of its appearance (its color, for example). And because of this, they can help us make our scrapers more specific and less fragile.\n",
    "\n",
    "Let's go back to our sample HTML and change it a bit:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "html_text = '''\n",
    "<div>\n",
    "    Answer to Question 1: <b class='answer'>42</b><br/>\n",
    "    Answer to Question 2: <b class='answer'>2</b><br/>\n",
    "    Answer to Question 3: <b class='answer'>foo</b><br/>\n",
    "    This isn't an answer to anything: <b>Just a little bold text!</b>\n",
    "</div>\n",
    "'''\n",
    "HTML(html_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What happens if we run the same scraping code we did above on this?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "Selector(text=html_text).xpath('//b/text()').extract()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Aha! We still get our answers, but we also get something we don't want, just because it happens to be within a `<b></b>` tag. Let's add knowledge of the 'answer' class to our scraper so it just picks up the right pieces:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "xpath_string = '//b[contains(concat(\" \", normalize-space(@class), \" \"),\" answer \")]/text()'\n",
    "Selector(text=html_text).xpath(xpath_string).extract()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's a little complicated. It's handling the possibility that one or more of the `<b>` tags may have *multiple* classes, and it's making sure to not match class names that just contain 'answer' (such as 'not_an_answer'). This is where scrapy's `css` function and the ability to chain your scraping calls comes in handy.\n",
    "\n",
    "Thinking about it another way, we could ask to find all the `<b></b>` tags, then just find the tags within that set that have the class `answer`, then find the text underneath:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "Selector(text=html_text).xpath('//b').css('.answer').xpath('.//text()').extract()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `css` function uses [CSS selectors](https://www.w3schools.com/cssref/css_selectors.asp) to navigate tags."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercises\n",
    "\n",
    "There's a lot more to *xpaths* and *CSS selectors* that can help you dig through a page to get just the right information, but you can do a lot with what you've already learned. There's more than one way to get to the information you need and there are no absolute rules about how to do it.\n",
    "\n",
    "Here's some more complicated HTML, similar to what you might see on a real webpage:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "html_text = '''\n",
    "<div class='main'>\n",
    "    <div class='navigation'>\n",
    "        <ul>\n",
    "            <li>Menu item 1</li>\n",
    "            <li>Menu item 2</li>\n",
    "            <li>Menu item 3</li>\n",
    "            <li>Menu item 4</li>\n",
    "        </ul>\n",
    "    </div>\n",
    "    <div class='details'>\n",
    "        <div class='section paragraphs'>\n",
    "            <h1>A bunch of paragraphs</h1>\n",
    "            <p>This is the first paragraph</p>\n",
    "            <p>This is the second paragraph</p>\n",
    "            <p>This is the third paragraph</p>\n",
    "        </div>\n",
    "        <div class='section table'>\n",
    "            <h1>A table of data</h1>\n",
    "            <p>This is in a paragraph tag but shouldn't show up in the exercises</p>\n",
    "            <table class='data'>\n",
    "                <thead>\n",
    "                    <tr><td>A</td><td>B</td><td>C</td></tr>\n",
    "                </thead>\n",
    "                <tbody>\n",
    "                    <tr><td>1</td><td>2</td><td>3</td></tr>\n",
    "                    <tr><td>2</td><td>4</td><td>6</td></tr>\n",
    "                    <tr><td>3</td><td>6</td><td>9</td></tr>\n",
    "                    <tr><td>4</td><td>8</td><td>12</td></tr>\n",
    "                </tbody>\n",
    "            </table>\n",
    "        </div>\n",
    "    </div>\n",
    "</div>\n",
    "'''\n",
    "HTML(html_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 1:\n",
    "\n",
    "Extract just the 3 paragraphs under the **A bunch of paragraphs** section."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "Selector(text=html_text).css('.details .section.paragraphs').xpath('.//p/text()').extract()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 2:\n",
    "\n",
    "Extract all of the numbers in column B of the table. Hint: `td[2]` instead of just `td` in your xpath will tell scrapy to always grab the 2nd `td` tag under every `tr` (or row) tag. Using a CSS selector, `td:nth-child(2)` does the same thing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "Selector(text=html_text).css('.details table tbody tr td:nth-child(2)').xpath('.//text()').extract()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "Selector(text=html_text).xpath('//table/tbody/tr/td[2]/text()').extract()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 3:\n",
    "\n",
    "Extract just the headings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "Selector(text=html_text).xpath('//h1/text()').extract()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 4:\n",
    "\n",
    "Extract just the menu items."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "Selector(text=html_text).css('.navigation').xpath('.//ul/li/text()').extract()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Scraping a real website!\n",
    "\n",
    "Now let's get to the best part: scraping a real website and crunching the results. We're going to use a website that's been set up to play with web scraping: [quotes.toscrape.com](http://quotes.toscrape.com)\n",
    "\n",
    "But wait! Up to this point, you've been able to see the HTML code you're scraping in plain text. How do you figure out what xpaths or CSS selectors to use on a real website?\n",
    "\n",
    "### Developer Tools\n",
    "\n",
    "Most web browsers have tools for inspecting HTML in web pages built into them. You just need to figure out how to access those tools. Sometimes you have to turn them on first. They're not something an average user needs, so browsers may hide them to avoid confusion. Here are instructions for turning them on in some popular browsers:\n",
    "\n",
    "1. [Chrome](https://developers.google.com/web/tools/chrome-devtools/)\n",
    "2. [Firefox](https://developer.mozilla.org/en-US/docs/Tools/Settings)\n",
    "3. [Safari](https://support.apple.com/en-ca/guide/safari/use-the-safari-develop-menu-sfri20948/mac)\n",
    "4. [Internet Explorer](https://msdn.microsoft.com/en-us/library/dd565628(v=vs.85).aspx)\n",
    "\n",
    "If you can't access the developer tools in your browser, you can still follow along in this notebook, as we've already captured the important paths. If you want to scrape other pages, though, these tools will help a lot.\n",
    "\n",
    "Here's an example of how you access them via Google Chrome:\n",
    "\n",
    "1. Open Chrome's Menu:\n",
    "  <br/>\n",
    "  <br/>\n",
    "  ![chrome-menu](images/chrome-menu.png)\n",
    "  <br/>\n",
    "  <br/>\n",
    "2. Navigate to \"More Tools\":\n",
    "  <br/>\n",
    "  <br/>\n",
    "  ![chrome-menu](images/more-tools.png)\n",
    "  <br/>\n",
    "  <br/>\n",
    "\n",
    "3. From there, choose \"Developer Tools\":\n",
    "  <br/>\n",
    "  <br/>\n",
    "  ![chrome-menu](images/developer-tools.png)\n",
    "  <br/>\n",
    "  <br/>\n",
    "\n",
    "\n",
    "You should see a screen like this (the developer tools may also show up on the right of your screen or in a separate window):\n",
    "\n",
    "<br/>\n",
    "<br/>\n",
    "![chrome-menu](images/devtools-screen.png)\n",
    "<br/>\n",
    "\n",
    "\n",
    "From there, you can click the button in the upper left corner to enter a mode where any element you select on the web page highlights the corresponding HTML element in your developer tools:\n",
    "<br/>\n",
    "\n",
    "![inspect-element](images/inspect-element.png)\n",
    "\n",
    "<br/>\n",
    "\n",
    "Here's what it looks like when you hover over the quotation text on a page:\n",
    "\n",
    "<br/>\n",
    "\n",
    "![inspect-element](images/quote-span.png)\n",
    "\n",
    "<br/>\n",
    "\n",
    "And here's what your Developer Tools will show:\n",
    "\n",
    "<br/>\n",
    "\n",
    "![inspect-element](images/quote-location.png)\n",
    "\n",
    "We want to get the author and the actual quote for every quotation on the page. Here are the paths you should find to those:\n",
    "\n",
    "```\n",
    "html body div.container div.row div.col-md-8 div.quote span.text\n",
    "html body div.container div.row div.col-md-8 div.quote span small.author\n",
    "```\n",
    "\n",
    "Let's use those paths to tell our scraper where to get the data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "response = requests.get(\"http://quotes.toscrape.com/page/1/\")\n",
    "quotes = Selector(text=response.content).css(\".quote\")\n",
    "for quote in quotes:\n",
    "    text = quote.css(\".text\").xpath('.//text()').extract_first()\n",
    "    author = quote.css(\".author\").xpath('.//text()').extract_first()\n",
    "    print(text + \" - \" + author)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Counting Quotes\n",
    "\n",
    "Here's where things start to get interesting. Scraping a single page of quotes is probably not worth your time creating a script (unless you're learning!). If copying and pasting can quickly get you what you want, by all means, do that. But what if we wanted to grab more than a page of quotes so that we had lots of data to crunch? With that, we could find out the most popular authors on the site. Or if any particular words show up more often in quotes.\n",
    "\n",
    "To do this, we're going to have to make a scraper that can navigate the website. As human beings, we would just keep clicking the \"Next\" button until we reached the end:\n",
    "\n",
    "![nav-buttons](images/nav-buttons.png)\n",
    "\n",
    "We can easily make our code do this too. If we can get our scraper to look at the HTML code that creates the \"Next\" button, we can find a link to the next page, and we can pass that, instead of `\"http://quotes.toscrape.com/page/1/\"` to our `requests.get` function.\n",
    "\n",
    "Let's inspect the \"Next\" button by clicking the element selection button in our Developer Tools:\n",
    "\n",
    "![nav-buttons](images/inspect-element.png)\n",
    "\n",
    "And then clicking the \"Next\" button:\n",
    "\n",
    "![nav-buttons](images/next-button.png)\n",
    "\n",
    "\n",
    "If everything goes as planned, you should see something like this:\n",
    "\n",
    "\n",
    "![nav-buttons](images/next-location.png)\n",
    "\n",
    "\n",
    "Now we can see that the exact path to the unerlying link in the \"Next\" button is:\n",
    "\n",
    "```\n",
    "html body div.container div.row div.col-md-8 nav ul.pager li.next a\n",
    "```\n",
    "\n",
    "Specifically, we see the link to the next page within the `href` attribute of the `<a></a>` tag. So we know what we need to grab. How are we going to grab it? \n",
    "\n",
    "We could use the entire string above as a CSS selector. However, if we do that, we're counting on the navigation never changing location. But if we looked for every `<a></a>` tag, we'd get a bunch of links we didn't want. What parts of that heirarchy of elements really capture what the \"Next\" button is? Well, it's a navigation button, so it's not too unreasonable to expect that `nav` is likely to be a parent even if a lot of other things change. The `.next` class helps us tell between the \"Next\" and \"Previous\" buttons, so that's important too. We don't really care that it's on a list item (`<li></li>`) so we won't include that in the selector. And to get the `href` attribute, we need an `a` tag. By making sure that the `a` tag has a parent with `.next`, we can be fairly sure we're getting the right one. The creator of the page could certainly put multiple unrelated links under the `.next` parent, but the way the HTML is structured already suggests that they wouldn't be likely to do that.\n",
    "\n",
    "Let's see if we can find the link to page 2 from page 1:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "response = requests.get(\"http://quotes.toscrape.com/page/1/\")\n",
    "Selector(text=response.content).css(\"nav .next a\").xpath(\"@href\").extract_first()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Great! It looks like we'll have to prepend `http://quotes.toscrape.com` to our `href` value before we can actually use it, but that's not a big deal because it never changes. We can keep looking for the \"Next\" button, get the \"href\" value, and stop when we can't find a \"Next\" button.\n",
    "\n",
    "#### Predicting Failure\n",
    "\n",
    "It's good to play a little \"What if?\" before going on. What if we get the wrong link? Or what if the very last page contains a \"Next\" link that just puts us back at page 1? We don't want our scraper to go on forever. Let's pick a number of pages as an absolute maximum before we want our scraper to stop looking for more. We'll use 50 for now.\n",
    "\n",
    "```python\n",
    "stop_value = 50\n",
    "```\n",
    "\n",
    "#### Final scraping code"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "stop_value = 50\n",
    "\n",
    "# To get rid of the beginning and ending quotation marks in the actual quotations.\n",
    "def trim_quotation_marks(text):\n",
    "    if text.startswith('“'):\n",
    "        text = text[1:]\n",
    "    if text.endswith('”'):\n",
    "        text = text[:-1]\n",
    "    return text\n",
    "\n",
    "def scrape_pages(path, scraped=0):\n",
    "    print(\"Scraping: \" + path)\n",
    "\n",
    "    response = requests.get(\"http://quotes.toscrape.com\" + path)\n",
    "    \n",
    "    raw_quotes = Selector(text=response.content).css(\".quote\")\n",
    "    quotes = []\n",
    "    \n",
    "    for quote in raw_quotes:\n",
    "        text = quote.css(\".text\").xpath('.//text()').extract_first()\n",
    "        # We already know this is a quotation. Let's get rid of those extra \" marks\n",
    "        text = trim_quotation_marks(text)\n",
    "        author = quote.css(\".author\").xpath('.//text()').extract_first()\n",
    "        quotes.append({'text': text, 'author': author})\n",
    "\n",
    "    next_page = Selector(text=response.content).css(\"nav .next a\").xpath(\"@href\").extract_first()\n",
    "    if next_page and scraped < stop_value:\n",
    "        quotes = quotes + scrape_pages(next_page, scraped + 1)\n",
    "        \n",
    "    return quotes\n",
    "\n",
    "all_quotes = scrape_pages(\"/page/1/\")\n",
    "\n",
    "# We'll show a preview\n",
    "all_quotes[0:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Visualizing Author Contributions\n",
    "\n",
    "Great! Now we have a real dataset! Let's visualize our top authors and words!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import plotly.graph_objs as go\n",
    "from plotly.offline import init_notebook_mode, iplot\n",
    "import pandas as pd\n",
    "\n",
    "authors = []\n",
    "texts = []\n",
    "\n",
    "for quote in all_quotes:\n",
    "    authors.append(quote['author'])\n",
    "    texts.append(quote['text'])\n",
    "\n",
    "quotes = pd.DataFrame({'author':authors, 'text':texts})\n",
    "author_counts = (quotes.groupby('author')\n",
    "                       .count()\n",
    "                       .reset_index()\n",
    "                       .rename(columns={'text':'count'})\n",
    "                       .sort_values(['count'], ascending=False))\n",
    "\n",
    "init_notebook_mode(connected=True)\n",
    "\n",
    "data = [go.Bar(x=author_counts['author'][0:10], y=author_counts['count'][0:10])]\n",
    "\n",
    "iplot(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looks like Albert Einstein and J.K. Rowling are favourites on this particular website.\n",
    "\n",
    "How long do you think it would have taken to come up with this answer manually? Counting author contributions on a quotes page may seem trivial, but can you imagine other uses of this sort of analysis? Could it be used to show whether a website focuses more on particular subjects or people to get an idea of how biased it may or may not be?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Visualizing Words in Popular Quotations\n",
    "\n",
    "Do certain words show up more than others in popular quotations? Let's find out!\n",
    "\n",
    "Here's how you can go about it:\n",
    "\n",
    "1. Mash all the text from all the quotations together.\n",
    "2. Remove words we know are really common, like 'a', 'the', 'and', etc. We've created a function, called `interesting_words` to help you with this. It will return a list of words that aren't very common words.\n",
    "3. Just as authors were counted, we can group by our words and count those, then sort them from highest to lowest.\n",
    "\n",
    "Here's what we get:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "all_quote_text = \" \".join(quotes['text'])\n",
    "\n",
    "wordset = pd.DataFrame(data={'word':interesting_words(all_quote_text),'count':1})\n",
    "\n",
    "word_counts = (wordset.groupby('word')\n",
    "                      .count()\n",
    "                      .reset_index()\n",
    "                      .sort_values(['count'], ascending=False))\n",
    "word_counts[0:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And now we can plot those words, just like we plotted the authors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "data = [go.Bar(x=reverse_column(word_counts['count'][0:10]), \n",
    "               y=reverse_column(word_counts['word'][0:10]),\n",
    "               orientation = 'h')]\n",
    "\n",
    "iplot(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looks like \"love\" wins out, which isn't too surprising. It would be interesting to see the top words in a larger set of quotations, or if different types of quotations (humorous vs. inspiring, for example) had different top words. Feel free to try that on this or other data you can find. You have all the tools!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "You've just covered a lot of ground. You learned a little about HTML tags. You've learned how to grab specific items on a page. And you've learned how you can use information from `<a></a>` tags to navigate to other pages on a website. Finally, you've seen some of the questions you can answer through scraping that would be hard to answer otherwise. Congratulations!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![alt text](https://github.com/callysto/callysto-sample-notebooks/blob/master/notebooks/images/Callysto_Notebook-Banners_Bottom_06.06.18.jpg?raw=true)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
